Last compiled: 2021-01-05
Machine Learning is a necessary concept that opens a new world of possibilities in data analysis. I won’t bore you with the theory, terminology, core algorithms, and application since there are plenty of online resources where the topic and subtopics are covered in detail. The scope of this session is supervised machine learning. Moreover, I will focus on two types of linear models:Other models such as tree-based models and support vector machines are common in machine learning; however, it is not the scope of this session.
Goal
The objective of this session is to build, evaluate, and visualize a machine learning linear regression model. Moreover, this session will follow these steps:
I decided to fit two different linear regression models: (1) linear model and (2) GLMNET model for comparison and to show how one can perform better than the other.
I will be using two data sets: bike_orderlines for data exploration and bike_features_tbl for modeling (source of raw data is linked below). You may download the data in case you want to try this code on your own.
Raw data source:
Download bike_orderlines_and_features_data.zip
As a first step, please load tidyverse and tidymodels libraries. For details on what these libraries offer, please refer to the comments in the code block below.
# Load Libraries
library(tidyverse)
# library(tibble) --> is a modern re-imagining of the data frame
# library(readr) --> provides a fast and friendly way to read rectangular data like csv
# library(dplyr) --> provides a grammar of data manipulation
# library(magrittr) --> offers a set of operators which make your code more readable (pipe operator)
# library(tidyr) --> provides a set of functions that help you get to tidy data
# library(stringr) --> provides a cohesive set of functions designed to make working with strings as easy as possible
# library(ggplot2) --> graphics
library(tidymodels)
# library(rsample) --> provides infrastructure for efficient data splitting, resampling and cross validation.
# library(parsnip) --> provides an API to many powerful modeling algorithms in R.
# library(recipes) --> tidy interface to data pre-processing (making statistical transformations) tools for feature engineering (prior to modeling).
# library(workflows) --> bundle your pre-processing, modeling, and post-processing together.
# library(tune) --> helps you optimize the hyperparameters of your model and pre-processing steps.
# library(yardstick) --> measures the effectiveness of models using performance metrics (metrics for model comparison).
# library(broom) --> converts the information in common statistical R objects into user-friendly, predictable formats.
# library(dials) --> creates and manages tuning parameters and parameter grids.
If you haven’t installed these packages, please install them by calling install.packages([name_of_package]) in the R console. After installing, run the above code block again.
# Load Data
bike_orderlines_tbl <- readRDS("00_raw_data/bike_orderlines.rds")
bike_features_tbl <- readRDS("00_raw_data/bike_features_tbl.rds")
# Glimpse the data
bike_orderlines_tbl %>% glimpse()
## Rows: 15,644
## Columns: 18
## $ order_id <dbl> 1, 1, 2, 2, 3, 3, 3, 3, 3, 4, 5, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 9…
## $ order_line <dbl> 1, 2, 1, 2, 1, 2, 3, 4, 5, 1, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1,…
## $ order_date <dttm> 2015-01-07, 2015-01-07, 2015-01-10, 2015-01-10, 2015-01-10, 2015-01-10, 2015-01-10, 2015-01-10, 2015-0…
## $ model <chr> "Spectral CF 7 WMN", "Ultimate CF SLX Disc 8.0 ETAP", "Neuron CF 8", "Speedmax CF 7.0", "Stitched 360 P…
## $ model_year <dbl> 2021, 2020, 2021, 2019, 2020, 2020, 2020, 2021, 2020, 2020, 2020, 2020, 2021, 2020, 2021, 2020, 2021, 2…
## $ category_1 <chr> "Mountain", "Road", "Mountain", "Road", "Mountain", "Hybrid / City", "Road", "Road", "Mountain", "Road"…
## $ category_2 <chr> "Trail", "Race", "Trail", "Triathlon Bike", "Dirt Jump", "City", "Triathlon Bike", "Cyclocross", "Endur…
## $ category_3 <chr> "Spectral", "Ultimate", "Neuron", "Speedmax", "Stitched", "Roadlite", "Speedmax", "Inflite", "Torque", …
## $ price <dbl> 3119, 5359, 2729, 1749, 1219, 1359, 2529, 1559, 3899, 6629, 2919, 4089, 2729, 2239, 3409, 1169, 3219, 1…
## $ quantity <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 3, 1…
## $ total_price <dbl> 3119, 5359, 2729, 1749, 1219, 1359, 2529, 1559, 3899, 6629, 2919, 8178, 2729, 2239, 3409, 1169, 3219, 1…
## $ frame_material <chr> "carbon", "carbon", "carbon", "carbon", "aluminium", "carbon", "carbon", "aluminium", "carbon", "carbon…
## $ weight <dbl> 13.80, 7.44, 14.06, 8.80, 11.50, 8.80, 8.20, 8.85, 14.40, 6.50, 21.25, 7.35, 8.91, 15.75, 8.88, 8.20, 1…
## $ url <chr> "https://www.canyon.com/en-de/mountain-bikes/trail-bikes/spectral/spectral-cf-7-wmn/2681.html", "https:…
## $ bikeshop <chr> "AlexandeRad", "AlexandeRad", "WITT-RAD", "WITT-RAD", "fahrschneller", "fahrschneller", "fahrschneller"…
## $ location <chr> "Hamburg, Hamburg", "Hamburg, Hamburg", "Bremen, Bremen", "Bremen, Bremen", "Stuttgart, Baden-Württembe…
## $ lat <dbl> 53.57532, 53.57532, 53.07379, 53.07379, 48.78234, 48.78234, 48.78234, 48.78234, 48.78234, 48.37154, 51.…
## $ lng <dbl> 10.015340, 10.015340, 8.826754, 8.826754, 9.180819, 9.180819, 9.180819, 9.180819, 9.180819, 10.898514, …
bike_features_tbl %>% glimpse()
## Rows: 231
## Columns: 67
## $ bike_id <dbl> 2875, 2873, 2874, 2876, 2877, 2225, 2091, 2086, 2088, 2120, 2087, 2394, 2386, 2393, 2385, …
## $ model <chr> "Aeroad CF SL Disc 8.0 Di2", "Aeroad CF SLX Disc 9.0 ETAP", "Aeroad CF SLX Disc 9.0 Di2", …
## $ model_year <dbl> 2020, 2020, 2020, 2020, 2020, 2019, 2019, 2021, 2020, 2020, 2020, 2020, 2020, 2020, 2020, …
## $ frame_material <chr> "carbon", "carbon", "carbon", "carbon", "carbon", "carbon", "carbon", "carbon", "carbon", …
## $ weight <dbl> 7.60, 7.27, 7.10, 7.73, 7.83, 6.80, 6.80, 7.60, 7.30, 7.20, 7.30, 8.72, 8.73, 9.04, 8.97, …
## $ price <dbl> 4579, 6919, 6429, 5069, 3609, 6139, 5359, 2629, 3699, 3219, 2829, 1559, 1559, 1269, 1269, …
## $ category_1 <chr> "Road", "Road", "Road", "Road", "Road", "Road", "Road", "Road", "Road", "Road", "Road", "R…
## $ category_2 <chr> "Race", "Race", "Race", "Race", "Race", "Race", "Race", "Race", "Race", "Race", "Race", "E…
## $ category_3 <chr> "Aeroad", "Aeroad", "Aeroad", "Aeroad", "Aeroad", "Aeroad", "Aeroad", "Aeroad", "Aeroad", …
## $ gender <chr> "unisex", "unisex", "unisex", "unisex", "unisex", "unisex", "unisex", "unisex", "unisex", …
## $ url <chr> "https://www.canyon.com/en-de/road-bikes/race-bikes/aeroad/aeroad-cf-sl-disc-8.0-di2/2875.…
## $ Frame <chr> "Canyon Aeroad CF SL Disc", "Canyon Aeroad CF SLX Disc", "Canyon Aeroad CF SLX Disc", "Can…
## $ Fork <chr> "Canyon FK0041 CF SLX Disc", "Canyon FK0041 CF SLX Disc", "Canyon FK0041 CF SLX Disc", "Ca…
## $ `Rear Derailleur` <chr> "Shimano Ultegra Di2 R8050 SS", "SRAM RED eTap AXS, 12-speed", "Shimano Dura-Ace Di2 R9150…
## $ `Front Derailleur` <chr> "Shimano Ultegra Di2 R8050", "SRAM RED eTap AXS", "Shimano Dura-Ace Di2 R9150", "SRAM Forc…
## $ Cassette <chr> "Shimano Ultegra R8000, 11-speed, 11-28T", "SRAM RED XG-1290, 12-speed, 10-28", "Shimano D…
## $ Crank <chr> "Shimano Ultegra R8000", "SRAM RED D1", "Shimano Dura-Ace R9100 PM", "SRAM Force D1", "Shi…
## $ `Bottom bracket` <chr> "Shimano Pressfit BB72", "SRAM Pressfit RED DUB", "Shimano Pressfit BB92", "SRAM Pressfit …
## $ `Thru Axle` <chr> "Canyon Thru Axle", "Canyon Thru Axle", "Canyon Thru Axle", "Canyon Thru Axle", "Canyon Th…
## $ Cockpit <chr> "Canyon H36 Aerocockpit CF", "Canyon H36 Aerocockpit CF", "Canyon H36 Aerocockpit CF", "Ca…
## $ Saddle <chr> "Selle Italia SLR", "Selle Italia SLR", "Selle Italia SLR", "Selle Italia SLR", "Selle Ita…
## $ Seatpost <chr> "Canyon S27 Aero VCLS CF", "Canyon S27 Aero VCLS CF", "Canyon S27 Aero VCLS CF", "Canyon S…
## $ Pedals <chr> "None included", "None included", "None included", "None included", "None included", "None…
## $ `Derailleur hanger` <chr> "Shop Derailleur Hanger GP0211-01", "Shop Derailleur Hanger GP0211-01", "Shop Derailleur H…
## $ Battery <chr> "", "SRAM eTap Powerpack", "", "SRAM eTap Powerpack", "", "SRAM eTap Powerpack", "", "", "…
## $ Brake <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""…
## $ `Shift Lever` <chr> "", "", "", "", "", "", "Shimano Di2 Remote Sprinter Shifter", "", "", "", "", "", "", "",…
## $ Chain <chr> "", "", "", "", "", "", "Shimano CN-HG901 11s", "", "", "", "", "Shimano CN-HG601 11s", "S…
## $ Stem <chr> "", "", "", "", "", "", "", "Canyon V13", "", "", "", "Canyon V15", "Canyon V15", "Canyon …
## $ Handlebar <chr> "", "", "", "", "", "", "", "Canyon H16 Aerobar AL", "", "", "", "Canyon H17 Ergobar AL", …
## $ Headset <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""…
## $ Motor <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""…
## $ `Battery Charger` <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""…
## $ `Flat Pedals` <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""…
## $ Chainguard <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""…
## $ `Aero Bar` <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""…
## $ `Brake Lever / Master` <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""…
## $ `Wheel Tire System` <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""…
## $ `Suspension Fork` <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""…
## $ `Disc Brake` <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""…
## $ Grips <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""…
## $ Chainring <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""…
## $ Display <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""…
## $ Modeswitch <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""…
## $ `Rear Shock` <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""…
## $ Light <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""…
## $ Fender <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""…
## $ `Bike Racks` <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""…
## $ `Brake 1` <chr> "", "", "", "", "", "SRAM S-900 Direct Mount", "Shimano Dura-Ace R9110", "Shimano 105 R701…
## $ `Brake 2` <chr> "", "", "", "", "", "SRAM S-900 Direct Mount", "Shimano Dura-Ace R9110", "Shimano 105 R701…
## $ `Shift-/ Brake Lever 1` <chr> "Shimano Ultegra Di2 R8070, 11-speed", "SRAM RED eTap AXS HRD, 12-speed", "Shimano Dura Ac…
## $ `Shift-/ Brake Lever 2` <chr> "Shimano Ultegra Di2 R8070, 11-speed", "SRAM RED eTap AXS HRD, 12-speed", "Shimano Dura Ac…
## $ `Wheel 1` <chr> "DT Swiss ARC 1400 Dicut", "DT Swiss ARC 1100 Dicut", "DT Swiss ARC 1100 Dicut", "DT Swiss…
## $ `Wheel 2` <chr> "DT Swiss ARC 1400 Dicut", "DT Swiss ARC 1100 Dicut", "DT Swiss ARC 1100 Dicut", "DT Swiss…
## $ `Tyre 1` <chr> "Continental Grand Prix 5000 / Attack 23 mm", "Continental Grand Prix 5000 / Attack 23 m…
## $ `Tyre 2` <chr> "Continental Grand Prix 5000, 25 mm", "Continental Grand Prix 5000, 25 mm", "Continental G…
## $ `Handlebar Tape 1` <chr> "Canyon Ergospeed Gel", "Canyon Ergospeed Gel", "Canyon Ergospeed Gel", "Canyon Ergospeed …
## $ `Handlebar Tape 2` <chr> "Canyon bar-end plug", "Canyon bar-end plug", "Canyon bar-end plug", "Canyon bar-end plug"…
## $ `Manuals and Accessories 1` <chr> "Canyon tool case", "Canyon tool case", "Canyon tool case", "Canyon tool case", "Canyon to…
## $ `Manuals and Accessories 2` <chr> "DT Swiss warranty & intended use manual", "CAYN INSTRUCTION SRAM ETP", "DT Swiss warranty…
## $ `Manuals and Accessories 3` <chr> "Canyon starter box", "Canyon starter box", "Canyon starter box", "Canyon starter box", ""…
## $ `Manuals and Accessories 4` <chr> "", "", "", "", "", "", "", "", "", "BAG REY ACC STRIKE C RIM BRAKE", "", "", "", "", "", …
## $ `Manuals and Accessories 5` <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""…
## $ `Manuals and Accessories 6` <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""…
## $ `Manuals and Accessories 7` <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""…
## $ `Manuals and Accessories 8` <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""…
## $ `Brake Rotor` <list> ["Shimano RT800", "SRAM Centerline X", "Shimano RT900", "SRAM Centerline X", "Shimano RT8…
# Explore Data
model_sales_tbl <- bike_orderlines_tbl %>%
select(total_price, model, category_2, frame_material) %>%
group_by(model, category_2, frame_material) %>%
summarise(total_sales = sum(total_price)) %>%
ungroup() %>%
arrange(desc(total_sales))
# Visualize
model_sales_tbl %>%
mutate(category_2 = as_factor(category_2) %>%
fct_reorder(total_sales, .fun = max) %>%
fct_rev()) %>%
ggplot(aes(frame_material, total_sales)) +
geom_violin() +
geom_jitter(width = 0.1, alpha = 0.5, color = "#2c3e50") +
#coord_flip() +
facet_wrap(~ category_2) +
scale_y_continuous(labels = scales::dollar_format(scale = 1e-6, suffix = "M", accuracy = 0.1)) +
tidyquant::theme_tq() +
labs(title = "Total Sales for Each Bike Model",
x = "Frame Material", y = "Revenue")
# Select the features of interest
bike_features_tbl <- bike_features_tbl %>%
select(model:gender, 'Rear Derailleur', 'Shift Lever')
# Create features with the recipes package
set.seed(1234)
split_obj <- rsample::initial_split(bike_features_tbl, prop = 0.75)
train_tbl <- training(split_obj)
test_tbl <- testing(split_obj)
# Remove spaces and dashes from the column names
train_tbl <- train_tbl %>% set_names(str_replace_all(names(train_tbl), " |-", "_"))
test_tbl <- test_tbl %>% set_names(str_replace_all(names(test_tbl), " |-", "_"))
# Create the model recipe
bike_recipe <-
recipe(price ~ ., data = train_tbl %>% select(-c(model:weight), -category_1, -c(category_3:gender))) %>%
step_novel(all_predictors(), -all_numeric()) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_zv(all_predictors())
# View summary of recipe to ensure all necessary steps are laid out before model fitting
bike_recipe %>% summary()
# Bundle the model and recipe with the workflow package and do fitting
bike_fit <- function(model, recipe, train_data) {
workflow() %>%
add_model(model) %>%
add_recipe(recipe) %>%
fit(train_data)
}
# Build a Complex Linear Regression Model
lr_model <- linear_reg("regression") %>%
set_engine("lm")
# Fit data to linear regression model
bike_fit_lm <- lr_model %>% bike_fit(recipe = bike_recipe,
train_data = train_tbl)
# Build an Elastic Net GLM Regularized Regression Model
glmnet_model <- linear_reg(mode = "regression",
penalty = 10,
mixture = 0.1) %>%
set_engine("glmnet")
# Fit data to GLMNET model
bike_fit_glm <- glmnet_model %>% bike_fit(recipe = bike_recipe,
train_data = train_tbl)
# Function to evaluate model with the yardstick package
calc_metrics <- function(model, test_data) {
model %>%
predict(test_data) %>%
bind_cols(test_data %>% select(price)) %>%
yardstick::metrics(truth = price, estimate = .pred)
}
# Compare test data predictions with actual values to get baseline model performance
# Evaluate linear regression model metrics
bike_fit_lm %>% calc_metrics(test_data = test_tbl)
# Evaluate GLMNET model metrics
bike_fit_glm %>% calc_metrics(test_data = test_tbl)
# Visualize Feature Importance via Complex Linear Regression Model
bike_fit_lm %>%
pull_workflow_fit() %>%
tidy() %>% na.omit() %>%
arrange(p.value) %>%
mutate(term = as_factor(term) %>% fct_rev()) %>%
ggplot(aes(x = estimate, y = term)) +
geom_point(color = "firebrick", size = 3) +
ggrepel::geom_label_repel(aes(label = scales::dollar(estimate, accuracy = 1, suffix = " €", prefix = "")),
size = 4, fill = "white", color = "#2E2E33", direction = "both", nudge_x = T, nudge_y = T) +
scale_x_continuous(labels = scales::dollar_format(suffix = " €", prefix = "")) +
theme_minimal() +
labs(title = "Linear Regression: Feature Importance",
subtitle = "Complex Linear Regression Model",
x = "Estimate",
y = "Feature")
# Visualize Feature Importance via GLMNET Linear Regression Model
bike_fit_glm %>%
pull_workflow_fit() %>%
tidy() %>%
# No p value here
arrange(desc(abs(estimate))) %>%
mutate(term = as_factor(term) %>% fct_rev()) %>%
ggplot(aes(x = estimate, y = term)) +
geom_point(color = "firebrick", size = 3) +
ggrepel::geom_label_repel(aes(label = scales::dollar(estimate, accuracy = 1, suffix = " €", prefix = "")),
size = 4, fill = "white", color = "#2E2E33", direction = "both", nudge_x = T, nudge_y = T) +
scale_x_continuous(labels = scales::dollar_format(suffix = " €", prefix = "")) +
theme_minimal() +
labs(title = "Linear Regression: Feature Importance",
subtitle = "Elastic Net GLM Regularized Regression Model",
x = "Estimate",
y = "Feature")
R!